Add legal hold membership to device reporting #192

maddie-vargo · 2020-12-29T15:13:05Z

Description of Change

This PR addresses Phase I of #176 to add an --include-devices option to the legal-hold command to show devices associated with the legal hold custodians captured by the query. This option will also total storage held by these devices and print it in a table grouped by organization.

This PR requires pandas as a dependency. Python 3.5 does not support the pandas library, so this should not be merged until support ends for 3.5.

Testing Procedure

code42 legal-hold show 1234 --include-devices

PR Checklist
Did you remember to do the below?

Add unit tests to verify this change
Add an entry to CHANGELOG.md describing this change
Add docstrings for any new public parameters / methods / classes

github-actions · 2020-12-29T15:13:21Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️

maddie-vargo · 2020-12-29T15:13:58Z

I have read the CLA Document and I hereby sign the CLA

antazoey · 2021-01-04T17:10:20Z

CHANGELOG.md

@@ -34,6 +34,9 @@ how a consumer would use the library (e.g. adding unit tests, updating documenta
    - `search` to search for audit-logs.
    - `send-to` to send audit-logs to server.

+- `code42 legal-hold show` option:


We do have a new devices list command that is merged by not yet released.

Would it make more sense to have a command code42 devices list --matter-id <UID> instead?

Just want to make sure all the options are considered, I don't really understand the use-case.

That's an interesting point. I chose to add it to legal-hold show because that command already showed details on a legal hold matter object, including users on legal hold and (optional) preservation policy details. I thought showing devices on legal hold would be a logical fit here.

From what I gather, the devices command renders mostly the Computer api resource. Unfortunately, there is no flag in the Computer api response that states whether a device is on legal hold. In order to pull legal hold devices, you first need to pull legal hold membership to get a list of users and then find the devices associated with those users. It would not be impossible to move this into the devices command, but we would still have to run some sort of legal hold py42 resource to determine the scope of devices.

Pinging @ceciliastevens for input as well.

Or a code42 legal-hold list-devices [MATTER-ID]. Trying to find a way to afford output format options (At least table and csv).

I agree, if we can find a way to structure this to add a format option, that would be helpful to most customers.

Currently, the legal-hold show command outputs data in the following formats:

Matter info - table

Users - columns

(optional) Preservation policy - JSON

What do you think about removing users from the show command and creating code42 legal-hold list-users [Matter ID] with an --include-devices optiion? I think it makes sense to keep users and devices closely tied together, rather than in separate commands. From a code perspective, to get devices, you'll have to get users first, anyway.

I like that! @timabrmsn What are you thoughts on this? I believe you were the one who implemented the show command.

I actually think it makes more sense to add legal hold status as an option to the devices list command, since that would be a useful option for people who are creating a report of all their devices to know which ones are on legal hold quickly when taking inventory of other things. It could just add a legalHoldMatterId column to the output.

And keeping users under legal-hold show still makes sense to me, since it's a high level overview of the matter, of which membership is a key detail. Otherwise the show command just becomes a single row of the list command with the option to also show preservation policy.

devices list already has --include-usernames which makes a separate call to api/Users and then merges the DataFrames on userId, so adding legal hold status would be easy enough to add using the same logic.

The storage total report is an interesting idea, but might be better suited to include in devices list report also, so it can be used more generically and not tied directly to only legal hold. I'm thinking something like --include-storage-totals on devices list that just adds a column per device for the sum of all that device's archives. Then if someone wanted to get a report broken down by matter Id, they could just filter by matter_id and sum the total in the csv report.

Ok, devices it is. The device pull actually includes userUid without the username option, so it's even easier to merge in the legal hold data.

There are two main things that we want this to accomplish:

Limit the device list to devices on a specific legal hold matter
devices --matter-uid - option to limit device search base on a specific legal hold matter, adds in columns such as legalHoldStatus, legalHoldMatterId, legalHoldMatterName

Add in legal hold membership columns for whatever device query you're running.
devices -include-legal-hold-membership - option to option to include legal hold membership on to the devices table, adds same columns as above but doesn't limit the population

This feels a little redundant, so any feedback is appreciated.

As for storage-totals: is a sum of all device's archives appropriate? (this is how I wrote my first draft). However, the current legal hold script does some logic to determine which destination is the most recent and complete, and only uses that archive size as the total storage.

I think that since filtering by matter-id outside of the CLI is simple enough (i.e. grep/Select-String or directly in excel), can we leave that to the end-user? I'd like to not have our option list expand too much. And I have some ideas for more generic column/row filtering logic that could work on any cmd that outputs a dataframe. Something like `--filter 'matter-id == 42' that would convert to a pandas expression under the hood. Although I haven't started trying to implement that so I'm not sure how feasible it is...

Also, because a device can be part of multiple legal holds, we need to think of how to display that. Can we just have a single column that has a list of matter-uids the device is part of, maybe separated by semi-colons or something?

Yeah, for storage totals I think adding a usedStorage column that sums the total combined archive size for a device would work, maybe we could also add an archiveCount column that gets added with that option?

Trying to keep what features we add as generic and multi-use-case friendly as possible.

src/code42cli/cmds/legal_hold.py

antazoey · 2021-01-04T17:14:41Z

src/code42cli/cmds/legal_hold.py

+        )
+        if len(devices_dataframe.index) > 0:
+            echo("\nMatter Members and Devices:\n")
+            click.echo(devices_dataframe.to_csv())


We typically use table format for show commands.

List commands, however, are known to support various output formats, inc. CSV.... My first comment on this PR suggests that this would make sense as a code42 devices list --matter-id <UID> command, which then you can tack on -f CSV or -f JSON etc.... Would that be better?

src/code42cli/cmds/legal_hold.py

antazoey · 2021-01-04T17:22:52Z

tests/cmds/test_legal_hold.py

@@ -191,6 +459,11 @@ def preservation_policy_response(mocker):
    return _create_py42_response(mocker, POLICY_RESPONSE)


+@pytest.fixture
+def devices_list_generator(mocker):
+    return [TEST_DEVICE_PAGE]


I find it is better to actually use generators for mocks when they are supposed to be rather than lists.. It saves future headaches, in my experience.

Might need to use side_effect instead of return_value in that case, not sure

…tests, update CHANGELONG

maddie-vargo · 2021-01-29T23:07:12Z

I have committed new changes that remove the edits to legal-hold and move it all to the devices command.

If a device is on multiple legal holds, I concatenate the IDs, names, and status into a comma-separated list.
Since legal hold membership is determined at the user level, deactivated devices can print legal hold membership. It did not seem meaningful to present that for deactivated devices, so I cleared those cells.

As for the used-storage piece - this data is present if the user includes the --include-backup-usage option. In efforts to keep this as generalized as possible, is it too much to add in a --include-total-storage option that would bring in a column that totals device storage across destination? Re: Tim's comment above, this calculation could be left to the end user.

kiran-chaudhary · 2021-02-05T09:49:35Z

Can we update the description please based on latest changes. I thought to try it out reading the description and then looking at the code it confused me more.

src/code42cli/cmds/devices.py

timabrmsn · 2021-02-05T13:07:21Z

src/code42cli/cmds/devices.py

    if df.empty:
        click.echo("No results found.")
    else:
        formatter = DataFrameOutputFormatter(format)
        formatter.echo_formatted_dataframe(df)


+def _add_legal_hold_membership_to_device_dataframe(sdk, df):
+    matters = _get_all_active_matters(sdk)


I agree with Kiran above that we probably shouldn't use the legal-hold module's internal functions, just for the sake of if someone later needs to change them it could break stuff here.

Since we only need to iterate through all membership dicts, we can simplify getting them all with a generator function instead of building a new list and extending it:

def _get_all_active_hold_memberships(sdk): for page in sdk.legalhold.get_all_matters(active=True): for matter in page["legalHolds"]: for _page in sdk.legalhold.get_all_matter_custodians( legal_hold_uid=matter["legalHoldUid"], active=True ): for membership in _page["legalHoldMemberships"]: yield membership

And pandas has a great function that can 'flatten' nested json into a dataframe for you, which can simplify a lot of this logic:

>>> df = pandas.json_normalize(_get_all_active_hold_memberships(sdk)) >>> df.head() legalHoldMembershipUid active creationDate legalHold.legalHoldUid legalHold.name user.userUid user.username user.email user.userExtRef 0 955061003769628092 True 2020-05-20T15:58:38.823-05:00 955060991270602172 Test2 945056771151950748 test@example.com test@example.com None 1 988814086101244865 True 2021-01-08T11:25:23.645-06:00 946193444244471969 Test 973619347957360307 legalhold@code42.com legalhold@code42.com None 2 955320624671252917 True 2020-05-22T10:57:44.903-05:00 946193444244471969 Test 955163198918558214 de60@example.com de60@example.com None 3 955320624637698485 True 2020-05-22T10:57:44.892-05:00 946193444244471969 Test 955163197341499910 de59@example.com de59@example.com None 4 955320624604144053 True 2020-05-22T10:57:44.885-05:00 946193444244471969 Test 955163195865104902 de58@example.com de58@example.com None

json_normalize is great for unpacking the nested dictionary. However, I seem to be losing records (1 per matter). If I run the generator response table through pandas.DataFrame, I get a table with more rows than I do running the same generator response through json_normalize. I haven't quite worked this out yet

Interesting, I'll see if I get the same behavior and try to figure out what's happening too.

I had to serialize the generator into a list in order for all records to be included, which may defeat the purpose of the generator. I pushed those changes moments ago.

Looks like there was a recent pandas bug about the generator thing (they weren't expecting generators to be passed): pandas-dev/pandas#35923

It's fixed in 1.2.2, but I think we can probably just call list() on the result for now instead of force updating to the latest pandas.

timabrmsn · 2021-02-05T13:10:22Z

src/code42cli/cmds/devices.py

+    df.loc[
+        df["status"] == "Deactivated",
+        ["legalHoldMemberActive", "legalHoldUid", "legalHoldName"],
+    ] = np.nan


Instead of using np.nan for null values here, let's use an empty string, I think NaN would confuse people in the output and think maybe there was a bug. To replace any np.nan values resulting from the .groupby(), you can call .fillna(value="") on the df to replace those as well.

I chose NaN to match the value assumed by pandas of an empty value. NaN's are generated when the legal hold table is merged with the devices table. If I use the above solution prior to merging with the devices table, the legalHoldUid and legalHold.name columns will contain a mix of None's, actual values, and NaNs.

To confirm: which of the following would you prefer?

NaN's in table view

None's in table view

(Either will produce null values in the other output formats)

Oops, I meant to say "resulting from the.merge()" instead of .groupby(). Yeah, the .fillna(value="") needs to be called after the merge.

I think we should convert NaN/None values to an empty string for table/CSV outputs, but we'd probably want to keep them as NaN/None when we output to json so it translates to null. But we should push that logic to the DataFrameOutputFormatter so we only need to handle it in one place for any commands that use DataFrames.

I'll start a new PR to rework the formatter to handle that, since I'd like to refactor it a bit for clarity and add some tests around handling dataframe nulls.

NaNs removed

Ok, I've opened #245 to add the null handling in the DataFrameOutputFormatter, so we don't need to be concerned with converting them here, since we want the conversion to be conditional on the output format chosen by the end-user.

Sorry to have you flip/flop this!

Resolved, and awaiting #245

src/code42cli/cmds/devices.py

…tput

maddie-vargo · 2021-02-16T22:31:53Z

All PR comments should be resolved. (Waiting on NaN resolution in #245)

The PR does not include a per-org storage calculation. The --include-backup-usage options reports archive size per archive. Is it preferred to create another option for --include-storage-totals, or rely on the user to calculate these totals outside of the CLI?

timabrmsn · 2021-02-17T19:58:48Z

Hmm, it seems that the --include-backup-usage logic was broken somewhere, since that option now doesn't actually add anything to the output..

maddie-vargo · 2021-02-17T21:25:13Z

I'm not able to recreate that behavior - the option spits out backup usage data on my branch. Diffing with the upstream code42cli master doesn't show any changes to backup usage logic.

Maybe something changed in py42? I'm using py42 v. 1.11.0 on my end

timabrmsn · 2021-02-17T21:41:05Z

Nope, it was just a device with a really large backup usage result towards the bottom of the list that was pushing that column way to the right.

I think a separate option + column for total storage a device is using would be really helpful. I also am thinking --include-backup-usage would probably be better turned into a separate command like list-archives, where it's one row per archive like our list-backup-sets. Right now parsing any specific data out of the backupUsage column is difficult for our end-users.

But that can be a separate PR...

I know a big part of the goal of this PR is to enable legal hold storage reporting, right? If we add an --include-storage-totals option that ensures include_backup_usage is True on the api call (but not necessarily on the table output), and then extracts and sums the total archiveBytes for each device and sticks it in a totalUsedStorage column, is that close enough for your customers who want these reports?

timabrmsn · 2021-02-18T16:27:48Z

@maddie-vargo PR #245 just merged

…of archives and archive bytes

maddie-vargo · 2021-02-18T17:18:36Z

--include-total-storage option now pushed up

src/code42cli/cmds/devices.py

timabrmsn · 2021-02-19T16:40:17Z

I think this is good, just the small thing of filtering out V2 archives from the storage totals. I'll write up a ticket for testing with all the acceptance criteria and make sure I haven't missed anything else as I go through documenting each change.

timabrmsn · 2021-02-19T18:46:49Z

json_normalize names the unpacked nested columns with dotted notation legalHold.legalHoldUid, legalHold.name can we rename this to just legalHoldUid and legalHoldName?

timabrmsn · 2021-02-23T13:56:09Z

src/code42cli/cmds/devices.py

@@ -466,8 +471,9 @@ def _break_backup_usage_into_total_storage(backup_usage):
    total_storage = 0
    archive_count = 0
    for archive in backup_usage:
-        archive_count += 1
-        total_storage += archive["archiveBytes"]
+        if archive["archiveFormat"] == "ARCHIVE_V1":


This should be if archive["archiveFormat"] != "ARCHIVE_V2":, since while V3 archives are not prevalent, they still should show up in the results.

kiran-chaudhary

Apart from change-log comment LGTM.

kiran-chaudhary · 2021-02-25T05:42:20Z

CHANGELOG.md

+### Added
+
+- `code42 devices list` option:
+    - `--include-legal-hold-membership` to add legal hold columns to the result output, printing the legal hold matter name and ID for any active device actively on legal hold


can we rephrase printing .... to something like prints the legalhold matter name and ID for any active device on legal hold!?

atleast active device actively seems off while reading.

yes, that is terrible phrasing. Resolved with latest commit

antazoey · 2021-02-26T15:03:07Z

CHANGELOG.md

@@ -58,6 +58,7 @@ how a consumer would use the library (e.g. adding unit tests, updating documenta
    - `search` to search for audit-logs.
    - `send-to` to send audit-logs to server.

+


Looks like our changelog has gotten messed up? The #Unreleased tag should not be there...

@unparalleled-js I re-pulled from master. The #Unreleased tag should be at the top, right? Let me know if I should fix it with this PR

… master

maddie-vargo added 3 commits December 28, 2020 14:38

Legal Hold work to meet Issue 176

b285a68

Fix to Changelog

ed72879

Minor fix to CHANGELOG

ea4381a

maddie-vargo and others added 3 commits December 29, 2020 09:18

Added to legal hold user guide

a0ebe57

Adjusting build parameters to bypass 3.5 for this PR

077dea1

Merge branch 'master' into iss176

03a3814